Goto

Collaborating Authors

 data science workflow


AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science

Luo, An, Xian, Xun, Du, Jin, Tian, Fangqiao, Wang, Ganghua, Zhong, Ming, Zhao, Shengchun, Bi, Xuan, Liu, Zirui, Zhou, Jiawei, Srinivasa, Jayanth, Kundu, Ashish, Fleming, Charles, Hong, Mingyi, Ding, Jie

arXiv.org Artificial Intelligence

Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models' ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems. Our data and code are publicly available here: https://github.com/jeremyxianx/Assisted-DS


AI, Humans, and Data Science: Optimizing Roles Across Workflows and the Workforce

Timpone, Richard, Yang, Yongwei

arXiv.org Artificial Intelligence

It is being leveraged to construct sur veys, synthesize data, conduct analysis, and write summaries of the results. While the promise is to create efficiencies and increase quality, the reality is not always as clear cut. Leveraging our framework of Truth, B eauty, and Justice ( TBJ) which we use to evaluate AI, machine learning and computational models for effective and ethical use ( Taber and Timpone 1997; Timpone and Yang 2024), we consider the potential and limitation of analy tic, generative, and agentic AI to augment data scientists or take on tasks traditionally done by human analysts and researchers. While AI can be leveraged to assist analysts in their tasks, we raise some warnings about push-button automation. Just as earlier eras of sur vey analysis created some issues when the increased ease of using statistical soft ware allowed researchers to conduct analyses they did not fully understand, the new AI tools may create similar but larger risks. We emphasize a human-machine collaboration perspective (Daugher ty and Wilson 2018) throughout the data science workflow and par ticularly call out the vital role that data scientists play under VUCA decision areas. We conclude by encouraging the advance of AI tools to complement data scientists but advocate for continued training and understanding of methods to ensure the substantive value of research is fully achieved by applying, interpreting, and acting upon results most effectively and ethically.


RAPIDS cuDF to Speed up Your Next Data Science Workflow - KDnuggets

#artificialintelligence

Over the years there has been exponential growth in data science applications, fueled by data collected from a wide variety of sources. In the last 10 years alone we have seen the implementation of data science, machine learning and deep learning. Although we hear a lot more about machine learning and deep learning, it is the core data science technique that a lot of companies focus on as this is where they make money and save a lot of money. However, studies show that 68% of data studies go unused and 90% of data is left unstructured. This is due to companies failing to focus on the data analytical processing phase, as it can take a lot of time, money and resources.


A Layman's Guide to Data Science Workflow

#artificialintelligence

When you get involved in a data science project, you must always take care of basic elements first before starting a project like business objective, domain knowledge, standard data science practices of an organization, and previous experiences while considering the next steps to problem solutions like data source identification, data modeling, data management, and data visualizations. The data science industry already offers a variety of data science workflow frameworks to solve different kinds of data science problems. It is not possible to develop an all-inclusive Data Science Workflow to solve all business problems. In lieu of that, it is important to follow some best-standard data science practices, such as automating data pipelines, planning inferences, and doing a post-mortem at the end of every project to identify any potential improvement areas. You will learn about various standard data science workflows in this article. You will also gain an understanding of the structure of a Data Science Workflow and the considerations that need to be taken into account as you follow the Data Science Workflow.


How to master Streamlit for data science

#artificialintelligence

To build a web app you'd typically use such Python web frameworks as Django and Flask. But the steep learning curve and the big time investment for implementing these apps present a major hurdle. Streamlit makes the app creation process as simple as writing Python scripts! In this article, you'll learn how to master Streamlit when getting started with data science. The data science process boils down to converting data to knowledge/insights while summarizing the conversion with the CRISP-DM and OSEMN data frameworks.


Is Data Science a Dying Career? - KDnuggets

#artificialintelligence

I recently read an article describing data science as an oversaturated field. The article predicted that ML engineers would replace data scientists in the upcoming years. According to the author of this article, most companies worked to solve very similar business problems with data science. Due to this, it wouldn't be necessary for data scientists to come up with novel methods of solving problems. The author went on to say that only basic data science skills were required in order to solve problems in most data-driven organizations.


Book Review: Data Science at the Command Line By Jeroen Janssens

#artificialintelligence

Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools written by Jeroen Janssens is the second edition of the series "Data Science at the Command Line". This book demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You will learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 80 tools–useful whether you work with Windows, macOS, or Linux. You will quickly discover why the command line is an agile, scalable, and extensible technology.


5 tips for improving your data science workflow

#artificialintelligence

The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. They stem from flaws in planning and communication. Execution mistakes can cost a day or two to fix, but planning mistakes can take weeks to months to set right. Mathematician and data analysis pioneer John Tukey said "an approximate answer to the right question is better than an exact answer to the wrong question." Machine learning solutions work by optimizing towards an objective function -- a mathematical formula that describes some value.


5 lesser-known Python libraries to improve your Data Science workflow

#artificialintelligence

"A star does not compete with other stars around it; it just shines." Python is by far the most popular programming language in the field of Data Science. The rich list of libraries, simple syntax and high productiveness make Python an extremely popular language among beginners as well as seasoned practitioners. Therefore, it is not unusual to find countless articles praising the power of Python and it's famous data science libraries like Numpy, Pandas, Tensorflow, Matplotlib, etc. This blog will try to divert attention to look at some of the lesser-known Python libraries that are slowly gaining recognition among the Data Science community. Streamlit has been gaining tremendous popularity in recent times.


ml-ops.org

#artificialintelligence

In this section, we provide a high-level overview of a typical workflow for machine learning-based software development. Generally, the goal of a machine learning project is to build a statistical model by using collected data and applying machine learning algorithms to them. Therefore, every ML-based software includes three main artifacts: Data, ML Model, and Code. The Figure below shows the core steps involved in a typical ML workflow. The initial step in any data science workflow is to acquire and prepare the data to be analyzed.